Meenakshi Nerolu $\hspace{18cm}$ December 15, 2019
A decision tree is one of most frequently and widely used supervised machine learning algorithms that can perform both regression and classification tasks.
There are several advantages of using decision treess for predictive analysis:
In this project, we will implement the decision tree algorithm using Python's Scikit-Learn library.
This database is chosen from https://www.kaggle.com/primaryobjects/voicegender#voice.csv. It is created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
The acoustic properties of the voice and speech can be used to detect gender of speaker. Human has natural capability of identifying the difference but when it comes to computer we need to teach it by providing inputs, methodology or different training data and make it learn. In this project, the focus is on training computer to identify the gender based on input of acoustic attributes using various Machine Learning algorithms and get the best results. The most common means of communication is speech signal. The recorded speech can serve as our input to the system. The system processes this speech signal to get acoustic attributes.
The following acoustic properties of each voice are measured and included within the CSV:
# Python Libraries and Packages
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#data frame from csv file
voice = pd.read_csv("voice.csv")
print("Dimension of the data set is :", voice.shape)
# Making new data frame with dropped NA values
voice_new=voice.dropna()
# dimension of the new dataframe
print("Dimension of the new data set is :", voice_new.shape)
voice=voice.rename(columns={"label": "gender"})
voice.head(3)
voice.gender[voice.gender == 'male'] = 1
voice.gender[voice.gender == 'female'] = 0
voice.gender = voice.gender.astype("float").astype("int")
!pip install plotly
!pip install cufflinks
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
#voice.iplot(kind='hist')
sns.pairplot(voice, kind='scatter')
Decision trees are highly interpretable and tend to perform well on classification problems.
import sklearn as sk
from sklearn import metrics
import sklearn.metrics
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import tree
# split dataset in features and target variable
col_names=['gender','meanfun','IQR','Q25','sd','sp.ent']
voice = voice[col_names]
features = np.array(voice.drop(['gender'], 1))
target = np.array(voice['gender'])
Let's split dataset by using function train_test_split().
#Dividing the data randomly into training and test set
X_train, X_test, y_train, y_test = train_test_split(features,target,
test_size=0.2,random_state=0)
After splitting the data, 80% data will be used for model training and 20% for model testing.
print ('Shape of X:', features.shape)
print ('Shape of y:', target.shape)
print ('Shape of X_train:', X_train.shape)
print ('Shape of y_train:', y_train.shape)
print ('Shape of X_test:', X_test.shape)
print ('Shape of y_test:', y_test.shape)
Scikit-Learn contains the tree library, which contains built-in classes for various decision tree algorithms. Since we are going to perform a classification task here, we will use the DecisionTreeClassifier class for this example. The fit method of this class is called to train the algorithm on the training data, which is passed as parameter to the fit method.
# Instantiate with a max depth of 3
tree_classifier = tree.DecisionTreeClassifier(max_depth=3)
# Fit a decision tree
tree_classifier.fit(X_train, y_train)
To make predictions, the predict method of the DecisionTreeClassifier class is used.
y_pred = tree_classifier.predict(X_test)
Now we'll see how accurate our algorithm is. For classification tasks some commonly used metrics are confusion matrix, precision, recall, and F1 score. Scikit-Learn's metrics library contains the classification_report and confusion_matrix methods that can be used to calculate these metrics.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Training accuracy
tree_classifier.score(X_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(tree_classifier.predict(X_test))
probs = pd.DataFrame(tree_classifier.predict_proba(X_test))
# Store metrics
tree_accuracy = metrics.accuracy_score(y_test, predicted)
tree_roc_auc = metrics.roc_auc_score(y_test, probs[1])
tree_confus_matrix = metrics.confusion_matrix(y_test, predicted)
tree_classification_report = metrics.classification_report(y_test, predicted)
tree_precision = metrics.precision_score(y_test, predicted, pos_label=1)
tree_recall = metrics.recall_score(y_test, predicted, pos_label=1)
tree_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# evaluate the model using 10-fold cross-validation
from sklearn.model_selection import cross_val_score
tree_cv_scores = cross_val_score(tree.DecisionTreeClassifier(max_depth=3), X_test, y_test, scoring='precision', cv=10)
!pip install graphviz
!pip install pydotplus
!pip install dtreeviz
#!pip install StringIO
!pip install pydot
import os
import sys
os.path
#def conda_fix(graph):
# path = os.path.join(sys.base_exec_prefix, "Library", "bin", "graphviz")
# paths = ("dot", "twopi", "neato", "circo", "fdp")
# paths = {p: os.path.join(path, "{}.exe".format(p)) for p in paths}
# graph.set_graphviz_executables(paths)
#graph = pydotplus.graph_from_dot_data(dot_data)
#conda_fix(graph)
#def conda_fix(graph):
# path = "C:\\ProgramData\\Anaconda3\\Lib\\site-packages"
# tmp = {'dot': path+"dot.exe",
# 'twopi': path+"twopi.exe",
# 'neato': path+"neato.exe",
# 'circo': path+"circo.exe",
# 'fdp': path+"fdp.exe"}
# graph.set_graphviz_executables(tmp)
#graph = pydotplus.graph_from_dot_data(dot_data)
#conda_fix(graph)
#os.environ["PATH"] += os.pathsep + 'C:\ProgramData\Anaconda3\Library\bin\graphviz\dot.exe'
os.environ["PATH"] += os.pathsep + "C:\\ProgramData\\Anaconda3\\Lib\\site-packagesdot.exe"
#Image(graph.create_png())
#graph.write_png('graph.png')
#data frame from csv file
insurance = pd.read_csv("insurance.csv",index_col=None, na_values=['NA'],sep=',')
def map_smoking(column):
mapped=[]
for row in column:
if row=="yes":
mapped.append(1)
else:
mapped.append(0)
return mapped
insurance["smoker_norm"]=map_smoking(insurance["smoker"])
def map_obese(column):
mapped=[]
for row in column:
if row>30:
mapped.append(1)
else:
mapped.append(0)
return mapped
insurance["obese"]=map_obese(insurance["bmi"])
insurance.sex[insurance.sex == 'male'] = 0
insurance.sex[insurance.sex == 'female'] = 1
#insurance[insurance.columns]
#list of column names to keep
col_names=['charges','age','children', 'smoker_norm','obese']
#creating new filtered dataframe
new_insurance = insurance[col_names]
#filtering the dataframe to include features and another with target
features = new_insurance.drop('charges', axis=1)
targets = new_insurance['charges']
#importing our function for splitting the data and an additional cross validation function,
from sklearn.model_selection import train_test_split, cross_val_score
#splitting our dataset randomly with the test data containing 10% of the data,
X_train_R, X_test_R, y_train_R, y_test_R = train_test_split(features,targets,
test_size=0.2, random_state=9)
The process of solving regression problem with decision tree using Scikit Learn is very similar to that of classification. However for regression we use DecisionTreeRegressor class of the tree library. Also the evaluation matrics for regression differ from those of classification.
The dataset we will use for this section is the same that we used in the Linear Regression project.
from sklearn.tree import DecisionTreeRegressor
tree_regressor = DecisionTreeRegressor(max_depth=3)
tree_regressor.fit(X_train_R, y_train_R)
y_pred_regressor = tree_regressor.predict(X_test_R)
df=pd.DataFrame({'Actual':y_test_R, 'Predicted':y_pred_regressor})
df
To evaluate performance of the regression algorithm, the commonly used metrics are mean absolute error, mean squared error, and root mean squared error. The Scikit-Learn library contains functions that can help calculate these values for us. To do so, use this code from the metrics package:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_R, y_pred_regressor))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_R, y_pred_regressor))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_R, y_pred_regressor)))
#Find the best parameter to prune the tree
def dt_error(n,X_train,y_train,X_test,y_test):
nodes = range(2, n)
error_rate = []
for k in nodes:
model = tree.DecisionTreeClassifier(max_leaf_nodes=k)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error_rate.append(np.mean(y_pred != y_test))
kloc = error_rate.index(min(error_rate))
print("Lowest error is %s occurs at n=%s." % (error_rate[kloc], nodes[kloc]))
plt.plot(nodes, error_rate, color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.xlabel('Tree Size')
plt.ylabel('Cross-Validated MSE')
plt.show()
return nodes[kloc]
n=dt_error(10,X_train,y_train,X_test,y_test)
print(y_pred[0:10])
print(y_test[0:10])
y_prob = tree_classifier.predict_proba(X_test)
y_prob[0:10]
!pip install scikit-plot
The lift curve is an evaluation curve that assesses the performance of your model. It shows how many times more than average the model reaches targets.
To construct this curve, you can use the plot_lift_curve method in the scikitplot module and the matplotlib.pyplot module. As for each model evaluation metric or curve, you need the true target values on the one hand and the predictions on the other hand to construct the cumulative gains curve.
import scikitplot as skplt
skplt.metrics.plot_lift_curve(y_test, y_prob)
plt.show()
tree.plot_tree(tree_classifier)
!pip install xgboost
from sklearn.tree import plot_tree
#from xgboost import XGBClassifier
#from xgboost import plot_tree
plot_tree(tree_classifier, filled=True)
plt.show()
from sklearn.tree import export_graphviz
export_graphviz(tree_classifier, out_file = 'tree.dot', feature_names = ['meanfun', 'IQR', 'Q25', 'sd', 'sp.ent'])
This graph is exported as tree.dot file and we can visualize this by using the link http://www.webgraphviz.com/
export_graphviz(tree_regressor, out_file = 'tree1.dot', feature_names = ['age','children', 'smoker_norm','obese'])
# output decision plot
#import pydot
#import pydotplus
#import graphviz
#from sklearn.tree import export_graphviz
#from pydotplus import graph_from_dot_data
#from graphviz import Digraph
#from IPython.display import Image
#dot_data = export_graphviz(tree_classifier, out_file = 'tree.dot', feature_names = ['meanfun', 'IQR', 'Q25', 'sd', 'sp.ent'])
#graph = pydotplus.graph_from_dot_data(dot_data)
#graph = graphviz.Source(dot_data)
#Image(graph.create_jpg())
#graph.write_jpg('graph.jpg')
#from dtreeviz.trees import *
#from IPython.core.display import display, HTML
#viz = dtreeviz(tree_classifier,
# X_train,
# y_train,
# target_name='gender',
# feature_names=features,
#class_names=["malignant", "benign"],
# fancy=False)
#display(HTML(viz.svg()))
#$ dot -Tpng tree_classifier.dot -o tree_classifier.png
#Image(graph.create_png())
ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.
skplt.metrics.plot_roc(y_test, y_prob)
plt.show()
This is Kolmogorov-Smirnov test. It lets us test the hypothesis that the sample is a part of the standard t-distribution.
skplt.metrics.plot_ks_statistic(y_test, y_prob)
plt.show()
The precision and recall can be calculated for thresholds using the precision_recall_curve() function that takes the true output values and the probabilities for the positive class as output and returns the precision, recall and threshold values.
The precision-recall plot is a model-wide evaluation measure that is based on two basic evaluation measures – recall and precision. Recall is a performance measure of the whole positive part of a dataset, whereas precision is a performance measure of positive predictions.
The precision-recall plot uses recall on the x-axis and precision on the y-axis. Recall is identical with sensitivity, and precision is identical with positive predictive value.
#from sklearn.metrics import precision_recall_curve
#precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
skplt.metrics.plot_precision_recall(y_test, y_prob)
plt.show()
skplt.estimators.plot_learning_curve(tree_classifier, X_train, y_train)
plt.show()
In the graph above, the training score is very high at the beginning and decreases and the cross-validation score is very low at the beginning and increases.
#!pip install drawtree